Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplementation of end symbol (Issue #237) #880

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

erezsh
Copy link
Member

@erezsh erezsh commented Apr 18, 2021

No description provided.

Copy link
Member

@MegaIng MegaIng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember what the problem was last time you tried to implement this? And is it fixed?

for token in self.lexer_state.lex(self.parser_state):
self.parser_state.feed_token(token)

def feed_eof(self, last_token=None):
"""Feed a '$END' Token. Borrows from 'last_token' if given."""
eof = Token.new_borrow_pos('$END', '', last_token) if last_token is not None else Token('$END', '', 0, 1, 1)
"""Feed a END Token. Borrows from 'last_token' if given."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, overlooked this in the review: Shouldn't we feed multiple ENDs here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think you're right.

@erezsh
Copy link
Member Author

erezsh commented Apr 18, 2021

I don't remember the problem, only that it messed some grammars up. But I changed the LALR logic since, made it a little cleaner, so maybe that was solved on the way?

I'll make sure to do thorough tests before merging it in.

@ThatXliner
Copy link
Contributor

I'm trying to test this right now. What's the terminal to use? $END?

@MegaIng
Copy link
Member

MegaIng commented Apr 18, 2021

No, It is just $

@ThatXliner
Copy link
Contributor

Maybe I'm going too far but is this supposed to work with custom Indenters? Because I'm facing a problem where

lark.exceptions.UnexpectedToken: Unexpected token Token('__$END$__', '') at line 1, column 1.

@MegaIng
Copy link
Member

MegaIng commented Apr 18, 2021

It is supposed to work. Can you give a short example script?

@erezsh
Copy link
Member Author

erezsh commented Apr 18, 2021

on_error was broken before, but now I fixed it. So if you were using it, you should try again with the latest commit.

@ThatXliner
Copy link
Contributor

ThatXliner commented Apr 20, 2021

Now it's just really slow. Or maybe it's just my grammar....

EDIT: Ok, never mind. Apparently, it was kinda hanging when I did

(value (_NL|$|_DEDENT)+ )

but removing _DEDENT works...

(I'm making a programming language-type thing. It's whitespace sensitive)

SECOND EDIT:

So I'm trying to make an error and there we go:

Traceback (most recent call last):
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 127, in feed_token
    action, arg = states[state][token.type]
KeyError: 'IDENTIFIER'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/bryanhu/projects/langs/tylaireum/tylaireum/__main__.py", line 79, in <module>
    print(parser.grammar.parse(INLINE).pretty())
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/lark.py", line 552, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parser_frontends.py", line 107, in parse
    return self.parser.parse(stream, start, **kw)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 42, in parse
    return self.parser.parse(lexer, start)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 176, in parse
    return self.parse_from_state(parser_state)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 193, in parse_from_state
    raise e
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 184, in parse_from_state
    state.feed_token(token)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 130, in feed_token
    raise UnexpectedToken(token, expected, state=self, interactive_parser=None)
^Clark.exceptions.UnexpectedToken: <exception str() failed>

But as you can see, I had to CTRL-C to stop it. It was just hanging

I am too lazy to send a minimal reproducible example right now, so if you want my grammar, etc, I'll send the whole enchilada.

@erezsh
Copy link
Member Author

erezsh commented Apr 20, 2021

@ThatXliner Thanks for finding that. I think I fixed it, so try now.

@ThatXliner
Copy link
Contributor

ThatXliner commented Apr 20, 2021

nope.

EDIT: Though I noticed it wasn't hanging for valid input this time

@erezsh
Copy link
Member Author

erezsh commented Apr 20, 2021

@ThatXliner Ok then, hit me with your enchilada. But make sure it's easy to see it fail.

@ThatXliner
Copy link
Contributor

Probably it was hanging for valid input when I sent my last comment but I most likely didn't double-check. Now, when I make it valid, it hangs.

Here's the enchilada: https://gist.github.com/ThatXliner/96829ce180b5659198a822cf74b6c309

Tell me if you can reproduce or not

@erezsh
Copy link
Member Author

erezsh commented Apr 20, 2021

I found a minimal breaking example:

from lark import Lark

p = Lark("""
    ?start: "foo" x "bar"
    x: $ x
""", parser="lalr")

p.parse("foo")

The infinite loop happens because $ is accepted forever, but has no valid route to unwind the stack.

I need to think about how to solve this. My current intuition is that this grammar shouldn't even be allowed. But I'm not sure how easy it is to detect reliably.

But maybe I just need to rethink the loop mechanism, and solve it by only providing the end token once.

@ThatXliner
Copy link
Contributor

I'm wondering how you're implementing the end token for right now. I think it's better to parse normally, get an EOF error, see if the grammar expects it and if so, continue parsing. Unless that's what you're doing right now.

@erezsh
Copy link
Member Author

erezsh commented Apr 20, 2021

So, it seems that this can be "solved" by testing that the parser only shifts once on END, and otherwise throwing a run time error for infinite loop.

But I don't like it. Lark shouldn't allow to compile a grammar that can never run to completion. But, I'm not sure how to detect it at compile time.

@@ -132,7 +133,7 @@ def feed_token(self, token, is_end=False):

if action is Shift:
# shift once and return
assert not is_end
# assert not is_end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder what would happen if you undid this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just an assert. It doesn't matter.

@seimit
Copy link

seimit commented Jun 1, 2021

Hi
Do you have plan this PR to be merged into the master soon?
Regards!

@erezsh
Copy link
Member Author

erezsh commented Jun 1, 2021

@seimit There are issues with this PR, so currently no.

The problem is that in some edge cases, the parser gets stuck in a loop forever. Which is something I never want Lark to be able to do.

@MegaIng
Copy link
Member

MegaIng commented Sep 30, 2021

@erezsh Shouldn't it be possible to add some early point reject all BNF rules that have a Terminal following an $ symbol? E.g. if we find a rule of the form

a: $ A

we completely throw it away, since it can never actually match.

For rules of the form

a: $ b

We need to check if b can be empty, and if it can not be empty we throw the rule away. I think we check during Grammar Analyzing anyway if a rule can be empty.

@erezsh
Copy link
Member Author

erezsh commented Sep 30, 2021

It's a good point that we can detect it during grammar analysis. And then we can just throw an error if anything follows the $END symbol. I don't think there's a good enough reason to allow them, empty rules or not.

@MegaIng
Copy link
Member

MegaIng commented Sep 30, 2021

Yes there is. ($|_NEWLINE)* should be allowed. I think that would break if we don't allow it.

@MegaIng
Copy link
Member

MegaIng commented Sep 30, 2021

OTOH, maybe it shouldn't be allowed. In effect it is the same as _NEWLINE* $?. I am not sure. But we need to make sure that for example the python grammar works when everywhere were we accept a statement ending _NEWLINE we also accept a $.

But also, can we somehow detect this:

a: end A
end: $

?

@erezsh
Copy link
Member Author

erezsh commented Sep 30, 2021

If we do it at the grammar analysis level, we can make sure FOLLOWS[$] is empty. That should take care of all edge cases (I think)

@erezsh
Copy link
Member Author

erezsh commented Sep 30, 2021

Hmm actually, no, that's the crux of the problem, which I guess is why I left it open.

On one hand, the grammar has to allow the option of something after an end token, otherwise, there's no advantage in implementing it. On the other hand, it's not obvious how to block the bad uses while allowing the good ones.

@MegaIng
Copy link
Member

MegaIng commented Sep 30, 2021

The point is more that there is that it is possible to accept nothing after $. Not sure how that can be expressed.

@lark-parser lark-parser deleted a comment from daniel-leicht Nov 29, 2022
@Erotemic
Copy link
Contributor

Throwing in my $0.02, just to say that this feature would be useful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants